• How much do offensive aspects in baseball affect a player’s salary?
• How much does Age affect a players salary?
• Which team generates the most RBIs?
• Which position in baseball makes the most money?
In this analysis I aimed to explain an MLB player’s salary based on their offensive statistics. This data set contains 478 observations of players and their offensive statistics and their respective salary from the year 2022. After using AIC variable selection and graphical analysis, this study found that RBI and Age contribute most to a players salary. Using OLS regression, I found that RBI and Age are statistically significant on all levels and a reported \(R^2 = 0.34342\).
My findings are: :D
Start: AIC=13238.3
Salary ~ 1
Df Sum of Sq RSS AIC
+ HR 1 4.2867e+15 1.9343e+16 13157
+ RBI 1 4.1456e+15 1.9484e+16 13160
+ BB 1 3.8543e+15 1.9775e+16 13166
+ Age 1 3.6114e+15 2.0018e+16 13171
+ SO 1 2.6136e+15 2.1016e+16 13191
+ OPS 1 1.8410e+15 2.1788e+16 13206
+ BA 1 1.3293e+15 2.2300e+16 13216
+ SB 1 4.7420e+14 2.3155e+16 13232
<none> 2.3629e+16 13238
Step: AIC=13156.63
Salary ~ HR
Df Sum of Sq RSS AIC
+ Age 1 4.1526e+15 1.5190e+16 13058
+ BB 1 2.8801e+14 1.9055e+16 13152
+ RBI 1 1.3198e+14 1.9211e+16 13156
<none> 1.9343e+16 13157
+ SO 1 8.4287e+12 1.9334e+16 13158
+ BA 1 4.8628e+12 1.9338e+16 13158
+ OPS 1 1.6593e+11 1.9343e+16 13159
+ SB 1 8.9588e+10 1.9343e+16 13159
Step: AIC=13057.61
Salary ~ HR + Age
Df Sum of Sq RSS AIC
+ BB 1 2.7845e+14 1.4912e+16 13052
+ RBI 1 1.9537e+14 1.4995e+16 13054
+ SB 1 1.4568e+14 1.5044e+16 13056
<none> 1.5190e+16 13058
+ BA 1 4.0571e+13 1.5150e+16 13058
+ OPS 1 1.4568e+13 1.5176e+16 13059
+ SO 1 7.9821e+11 1.5189e+16 13060
Step: AIC=13051.87
Salary ~ HR + Age + BB
Df Sum of Sq RSS AIC
<none> 1.4912e+16 13052
+ SB 1 6.9871e+13 1.4842e+16 13052
+ SO 1 6.3024e+13 1.4849e+16 13052
+ RBI 1 4.3889e+13 1.4868e+16 13053
+ OPS 1 7.7980e+12 1.4904e+16 13054
+ BA 1 3.5841e+10 1.4912e+16 13054
Call:
lm(formula = Salary ~ HR + Age + BB, data = mlb_data)
Residuals:
Min 1Q Median 3Q Max
-18239809 -3504679 -825701 1667401 29570560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24199640 2466349 -9.812 < 2e-16 ***
HR 236228 47918 4.930 1.19e-06 ***
Age 895247 83473 10.725 < 2e-16 ***
BB 59629 21446 2.780 0.00568 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6002000 on 414 degrees of freedom
Multiple R-squared: 0.3689, Adjusted R-squared: 0.3644
F-statistic: 80.68 on 3 and 414 DF, p-value: < 2.2e-16
:D :) :] :3
---
title: "EDA for MLB Salary"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "#FFD580"
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title
{
/* chart_title */
font-size: 18px;
}
body
{
/* Normal */
font-size: 16px;
}
</style>
```{css}
/* Set font color of inactive tab to green */
.nav-tabs-custom .nav-tabs > li > a
{
color: green;
}
/* Set font color of active tab to red */
.nav-tabs-custom .nav-tabs > li.active > a
{
color: red;
}
/* To set color on hover */
.nav-tabs-custom .nav-tabs > li.active > a:hover
{
color: purple;
}
```
```{r setup, include=FALSE}
library(flexdashboard)
```
Overview
===
Column {data-width=650}
-----------------------------------------------------------------------
### Research Questions/Purpose
• How much do offensive aspects in baseball affect a player's salary?
• How much does Age affect a players salary?
• Which team generates the most RBIs?
• Which position in baseball makes the most money?
In this analysis I aimed to explain an MLB player's salary based on their offensive statistics. This data set contains 478 observations of players and their offensive statistics and their respective salary from the year 2022. After using AIC variable selection and graphical analysis, this study found that RBI and Age contribute most to a players salary. Using OLS regression, I found that RBI and Age are statistically significant on all levels and a reported $R^2 = 0.34342$.
```{r}
```
```{r, data/packages}
library(pacman)
pacman::p_load(tidyverse, rvest, stringr,
ggplot2, forcats, writexl,
maps, viridis, scales, plotly)
mlb_salary <- read_csv("mlb_salary.csv")
#mlb_batting <- read_delim("C:\\Users\\altos\\Documents\\Datasets\\mlbb.txt", delim = ";")
mlb_batting <- read_csv("mlb_batting.csv")
#mlb_map <- read_csv("map_mlb.csv")
#colnames(mlb_map)[1] <- "Tm"
# Getting rid of observations where the playerwas traded, taking most recent trade
mlb_batting <- mlb_batting[!duplicated(mlb_batting$Name),]
mlb_batting <- mlb_batting[!mlb_batting$Tm == "TOT",]
# Cleaning name column
mlb_batting$Name <- gsub("\xa0", " ", mlb_batting$Name)
mlb_batting$Name <- gsub("<e9>", "e", mlb_batting$Name)
mlb_batting$Name <- gsub("#", "", mlb_batting$Name)
mlb_batting$Name <- gsub("<c1>", "A", mlb_batting$Name)
mlb_batting$Name <- gsub("<f3>", "o", mlb_batting$Name)
mlb_batting$Name <- gsub("<e1>", "a", mlb_batting$Name)
mlb_batting$Name <- gsub("<f1>", "n", mlb_batting$Name)
mlb_batting$Name <- gsub("<ed>", "i", mlb_batting$Name)
mlb_batting$Name <- stringr::str_replace(mlb_batting$Name, '\\*', '')
#mlb_salary$Player <- sub("(\\w+),\\s(\\w+)","\\2 \\1", mlb_salary$Player)
#colnames(mlb_salary)[1] <- "Name"
#mlb_salary <- subset(mlb_salary, select = -c(Average_Annual, Total_Value, Years))
#mlb_data <- merge(x = mlb_batting, y = mlb_salary, by = "Name")
mlb_batting <- mlb_batting %>%
arrange(Name)
allnames <- unlist(str_split(mlb_salary$Player, ", "))
mlb_salary$First <- allnames[1:1942%%2==0]
mlb_salary$Last <- allnames[1:1942%%2==1]
mlb_salary$Name <- paste(mlb_salary$First, mlb_salary$Last)
mlb_salary$Name <- str_remove_all(mlb_salary$Name, "\\.")
mlb_batting$Name <- str_remove_all(mlb_batting$Name, "\\.")
mlb_salary <- mlb_salary %>%
arrange(Name)
mlb_salary$Name[which(mlb_salary$Name=="A Minter.J.")] <- "Alex Minter"
mlb_salary$Name[which(mlb_salary$Name=="A Puk.J.")] <- "Andrew Puk"
mlb_salary$Name[which(mlb_salary$Name=="D'Travis Arnaud")] <- "Travis d'Arnaud"
mlb_salary$Name[which(mlb_salary$Name=="Abraham Toro-Hernandez")] <- "Abraham Toro"
mlb_salary$Name[which(mlb_salary$Name=="AJ Pollock IV")] <- "AJ Pollock"
mlb_salary$Name[which(mlb_salary$Name=="Alexander Colome")] <- "Alex Colome"
mlb_salary$Name[which(mlb_salary$Name=="Mullins Cedric II")] <- "Cedric Mullins"
mlb_salary$Name[which(mlb_salary$Name=="Christopher Martin")] <- "Chris Martin"
mlb_salary$Name[which(mlb_salary$Name=="Jazz Chisholm")] <- "Jazz Chisholm Jr"
mlb_salary$Name[which(mlb_salary$Name=="LaMonte Wade")] <- "LaMonte Wade Jr"
mlb_salary$Name[which(mlb_salary$Name=="Lourdes Gurriel")] <- "Lourdes Gurriel Jr"
mlb_salary$Name[which(mlb_salary$Name=="Vladimir Guerrero")] <- "Vladimir Guerrero Jr"
mlb_batting$Name[which(mlb_batting$Name=="Michael A Taylor")] <- "Michael Taylor"
mlb_salary <- mlb_salary %>%
select(Name, Position, Salary)
mlb_data <- mlb_batting %>%
left_join(mlb_salary, key = c(Name))
mlb_data <- mlb_data %>%
filter(!is.na(Salary))
mlb_data[,"Age_Gp"] <- NA
mlb_data <- mlb_data %>%
mutate(Age_Gp = case_when(
Age >= 20 & Age < 25 ~ "20-25",
Age >= 25 & Age < 30 ~ "25-30", Age >= 30 & Age < 35 ~ "30-35",
Age >= 35 & Age < 40 ~ "35-40", Age >= 40 & Age < 45 ~ "40-45",
Age >= 45 & Age < 50 ~ "45-50"))
mlb_data$Age_Gp <- as.factor(mlb_data$Age_Gp)
mlb_data[,"RBI_Gp"] <- NA
mlb_data <- mlb_data %>%
mutate(RBI_Gp = case_when(
RBI >= 0 & RBI < 30 ~ "0-30",
RBI >= 30 & RBI < 60 ~ "30-60", RBI >= 60 & RBI < 90 ~ "60-90",
RBI >= 90 & RBI < 120 ~ "90-120", RBI >= 120 ~ "120+"))
mlb_data$RBI_Gp <- as.factor(mlb_data$RBI_Gp)
avg_p <- mlb_data %>%
group_by(Position) %>%
summarise(Avg_RBI = mean(RBI), Avg_Sal = mean(Salary))
avg_tm <- mlb_data %>%
group_by(Tm) %>%
summarise(Avg_Sal = mean(Salary), Avg_RBI = mean(RBI))
avg_p$Position <- as.factor(avg_p$Position)
avg_tm$Tm <- as.factor(avg_tm$Tm)
#avg_tm <- avg_tm %>%
#left_join(mlb_map, key = c(Tm))
avg_p <- avg_p[-c(5,10),]
```
Results
===
Column {data-width=650}
-----------------------------------------------------------------------
### Findings/Statistical Significance
My findings are: :D
### Regression Summary
```{r results}
# Efficient Model
# fit.mlb <- lm(Salary ~ Age + RBI, data = mlb_data)
# summary(fit.mlb)
# AIC Model
fit.mlb2 <- lm(Salary ~ BB + Age + HR, data = mlb_data)
S2 <- summary(fit.mlb2)
```
```{r fig.align='center', out.width="80%"}
knitr::include_graphics("Linear.jpg")
```
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### Boxplot of Salary vs Player's Age Group
```{r}
ggplot(mlb_data, aes(x = Age_Gp, y = Salary)) +
geom_boxplot(color = "#81d4fa", fill = "#03a9f4") +
labs(x = "Age Group") +
scale_y_continuous(labels = label_comma()) +
theme_classic()
```
### Boxplot of Salary vs. RBIs
```{r}
mlb_data %>%
mutate(class = fct_reorder(RBI_Gp, Salary, na.rm = T, .fun = 'median')) %>%
ggplot(aes(x = reorder(RBI_Gp, Salary), y = Salary)) +
geom_boxplot(fill = "#ffd9b3", color = "#ffa64d") +
labs(x = "Number of RBIs", y = "Salary") +
scale_y_continuous(labels = label_comma()) +
theme_classic()
```
Variable Selection
===
Column {data-width=650}
-----------------------------------------------------------------------
### AIC Variable Selection
```{r}
fit.null <- lm(Salary ~ 1, data = mlb_data)
fit.AIC <- step(fit.null, scope = Salary ~ Age + HR + RBI + SB + BB + SO + BA + OPS, direction = "forward", k = 2)
fit.a <- lm(Salary ~ HR + Age + BB, data = mlb_data)
summary(fit.a)
```
Column {data-width=650}
-----------------------------------------------------------------------
### Scatter plot Matrix of Salary vs RBI + Age
```{r}
pairs(~log(Salary) + Age + HR + BB, data = mlb_data)
```
### Scatter plot Matrix of log(Salary) vs RBI + Age
```{r}
pairs(~log(Salary) + Age + RBI, data = mlb_data)
```
Position Summary
===
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### Position and Avg. Salary
```{r graph (stadium)}
devtools::install_github("bdilday/GeomMLBStadiums")
library(GeomMLBStadiums)
mlb_position <- as.data.frame(rep(NA, 7))
colnames(mlb_position)[1] <- "Position"
mlb_position[,"Latitude"] <- NA
mlb_position[,"Longitude"] <- NA
mlb_position[,"Avg_Salary"] <- NA
mlb_position$Avg_Salary <- c(7575904, 4294942, 9047747,
6954996, 3414633, 3330355,
5636252)
mlb_position$Position <- c("1st Base", "2nd Base", "3rd Base",
"Shortstop (SS)", "Pitcher", "Catcher",
"Outfield (OF)")
mlb_position$Latitude <- c(72,40,-72,-40,1,1,1)
mlb_position$Longitude <- c(85,115,85,115,70,-20,270)
p <- ggplot(mlb_position, aes(x = Latitude, y = Longitude, text = paste0(Position, ":\n", "Mean Salary: ", Avg_Salary))) +
geom_spraychart(stadium_transform_coords = TRUE, stadium_segments = "all", stadium_ids = "reds") +
coord_fixed() +
theme_void() +
theme(axis.title.x=element_blank(), axis.text.x=element_blank(),
axis.ticks.x=element_blank(), axis.title.y=element_blank(),
axis.text.y =element_blank(), axis.ticks.y=element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank()) +
geom_point(aes(size = Avg_Salary), color = "#FFD580")
font <- list(family = "Mono", size = 15, color = "black")
label <- list(bgcolor = "#FFD580", font = font)
ggplotly(p, tooltip = "text") %>%
style(hoverlabel = label) %>%
layout(font = font)
```
### Avg. Salary vs Position
```{r bar}
p2 <- ggplot(avg_p, aes(x = Position, y = Avg_Sal), text = ) +
geom_bar(stat = "identity", color = "orange", fill = "#FFA07A") +
scale_y_continuous(labels = label_comma()) +
labs(y = "Avg. Salary") +
theme_classic()
ggplotly(p2, tooltip = "text")
```
### Avg. RBI vs Position
```{r}
ggplot(avg_p, aes(x = Position, y = Avg_RBI)) +
geom_histogram(stat = "identity", color = "#FFA07A", fill = "#DE3163") +
labs(y = "Avg. RBI") +
theme_classic()
```
Column {data-width=550}
-----------------------------------------------------------------------
### Analysis
:D :) :] :3
Team Summary
===
Column {.tabset data-length=650}
-----------------------------------------------------------------------
### Bar Chart of Avg Salary vs. Team
```{r map, fig.align='center', out.width="80%"}
options(scipen = 10)
avg_tm$label <- paste0(avg_tm$Tm, ":\n",
"Average Salary: $", round(avg_tm$Avg_Sal))
p_salary <- ggplot(avg_tm, aes(y = reorder(Tm, Avg_Sal), x = Avg_Sal, text = label)) +
geom_bar(stat = "identity", fill = "#F1959B") +
labs(y = "Team", x = "Avg. Salary") +
theme_classic() +
scale_x_continuous(label= comma, limits=c(0, 13000000))
ggplotly(p_salary, tooltip = "text")
```
### Bar Chart of Avg. RBIs vs. Team
```{r map2, fig.align='center', out.width="80%"}
ggplot(avg_tm, aes(y = reorder(Tm, Avg_RBI), x = Avg_RBI)) +
geom_histogram(stat = "identity", fill = "#ff8164") +
labs(y = "Team", x = "Avg. RBI") +
theme_classic()
```
### Map
```{r}
```
Column {data-length=650}
-----------------------------------------------------------------------
### Analysis
```{r}
library(shiny)
tags$iframe(style="height:1000px; width:100%; scrolling=yes",
src="MTH 207 comments from Becky.pdf")
```